Skip to content

Keyword dedup: merge action, duplicate-keywords health check, whitespace ward (#1352)#1361

Merged
jonfroehlich merged 3 commits into
masterfrom
1352-merge-keywords
Jun 21, 2026
Merged

Keyword dedup: merge action, duplicate-keywords health check, whitespace ward (#1352)#1361
jonfroehlich merged 3 commits into
masterfrom
1352-merge-keywords

Conversation

@jonfroehlich

Copy link
Copy Markdown
Member

Implements #1352 — the destructive keyword-merge tooling deferred from the #1346 Phase 4 audit — plus the finder and a layer-1 prevention ward.

What

Finder — "Duplicate keywords" data-health check. Read-only check that clusters keywords sharing a normalized key (strip + casefold) and surfaces per-model usage counts, so the editor can see which variant to keep. Each row deep-links ("Merge in admin →") to the Keyword changelist pre-filtered (?q=<cluster key>) to that cluster.

Fixer — "Merge selected keywords" admin action. Select 2+ keywords on the Keyword changelist → intermediate confirmation page to pick the target → reassigns every reference across all six keyword-holding models (Publication/Talk/Poster/Grant/Project/ProjectUmbrella) onto the target, then deletes the rest. Reattach is via obj.keywords.add(target) (idempotent — an object already tagged with the target gains no duplicate row) inside a transaction; the source's deletion drops its own M2M rows.

Layer-1 ward — whitespace normalization on save. Keyword.save() trims ends and collapses internal whitespace runs, so "Speech " / " Speech" can't coexist with their clean forms. Catches every creation path including the inline "add keyword" widget. No migration, casing preserved (VR, HCI, iOS).

How to use

Admin → ConfigurationKeywords → tick 2+ → Action dropdown → Merge selected keywords → pick target → Merge. Or start from Data Health → Duplicate keywords and click Merge in admin →.

Scope

Admin-only; no model or migration changes (not Pa11y-scanned). The merge UI is the standard Django intermediate-action confirmation page.

Layer 2 (follow-up, not in this PR)

Case-insensitive uniqueness (blocking Speech vs speech) is a DB UniqueConstraint(Lower('keyword')) — a migration that will fail to apply while dupes exist, so it must come after a prod dedup pass using this tool. Tracked separately.

Test

python manage.py test website.tests.test_keyword_merge website.tests.test_data_health --settings=makeabilitylab.settings_test — green (incl. all-six-relation reassignment, dedup, single-selection no-op, confirm round-trip, whitespace normalization, finder clustering, and an end-to-end deep-link render).

🤖 Generated with Claude Code

jonfroehlich and others added 3 commits June 20, 2026 16:41
… check (#1352)

Keywords are free-text with no uniqueness constraint, so case/whitespace
variants (Speech / speech / Speech ) coexist and fragment the public keyword
pages. This adds both halves of the cleanup loop:

Finder — a read-only "Duplicate keywords" data-health check that clusters
keywords sharing a normalized key (strip + casefold) and surfaces per-model
usage counts, so the editor can see which variant to keep.

Fixer — a destructive "Merge selected keywords" admin action on the Keyword
changelist. Select 2+ keywords -> intermediate confirmation page to pick the
target -> reassigns every reference across all six keyword-holding models
(Publication/Talk/Poster/Grant/Project/ProjectUmbrella) onto the target, then
deletes the rest. Reattach is via obj.keywords.add(target) (idempotent, so an
object already tagged with the target gains no duplicate row) inside a
transaction; the source's deletion drops its own M2M rows.

Tests cover reassignment across all six relations, source deletion, the
no-duplicate dedup case, the target-in-sources guard, the single-selection
no-op, the confirm-POST round trip, and the finder's clustering.

Admin-only; no model or migration changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Layer-1 ward against near-duplicate keywords: Keyword.save() now trims the
ends and collapses internal whitespace runs to a single space, so "Speech ",
" Speech", and "Speech  recognition" can't coexist with their clean forms.
Catches every creation path, including the inline "add keyword" widget on
Publication/Project forms. No migration, no data cleanup needed.

Casing is intentionally preserved (VR, HCI, iOS). Case-insensitive uniqueness
(blocking "Speech" vs "speech") is the separate layer-2 DB constraint, deferred
until existing prod dupes are merged with the new action — the data-health
finder's job is now exactly that remaining case-variant class.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st (#1352)

Wires the finder to the fixer. The "Duplicate keywords" detail page now shows a
per-row "Merge in admin →" link that opens the Keyword changelist pre-filtered
(?q=<cluster key>) to exactly that cluster's variants, so the editor can
select-all and run the merge action instead of re-finding them by hand.

Implemented as an opt-in HealthCheck.row_link((label, url)) hook: the detail
view adds an Action column only when a check provides links, so the other nine
checks and the CSV export are unchanged. Covered by a row_link URL test and an
end-to-end superuser render test asserting the link is in the page.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jonfroehlich jonfroehlich merged commit 439a7c7 into master Jun 21, 2026
3 checks passed
@jonfroehlich jonfroehlich deleted the 1352-merge-keywords branch June 22, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant